39 research outputs found
Predefined Sparseness in Recurrent Sequence Models
Inducing sparseness while training neural networks has been shown to yield
models with a lower memory footprint but similar effectiveness to dense models.
However, sparseness is typically induced starting from a dense model, and thus
this advantage does not hold during training. We propose techniques to enforce
sparseness upfront in recurrent sequence models for NLP applications, to also
benefit training. First, in language modeling, we show how to increase hidden
state sizes in recurrent layers without increasing the number of parameters,
leading to more expressive models. Second, for sequence labeling, we show that
word embeddings with predefined sparseness lead to similar performance as dense
embeddings, at a fraction of the number of trainable parameters.Comment: the SIGNLL Conference on Computational Natural Language Learning
(CoNLL, 2018
Dual Rectified Linear Units (DReLUs): A Replacement for Tanh Activation Functions in Quasi-Recurrent Neural Networks
In this paper, we introduce a novel type of Rectified Linear Unit (ReLU),
called a Dual Rectified Linear Unit (DReLU). A DReLU, which comes with an
unbounded positive and negative image, can be used as a drop-in replacement for
a tanh activation function in the recurrent step of Quasi-Recurrent Neural
Networks (QRNNs) (Bradbury et al. (2017)). Similar to ReLUs, DReLUs are less
prone to the vanishing gradient problem, they are noise robust, and they induce
sparse activations.
We independently reproduce the QRNN experiments of Bradbury et al. (2017) and
compare our DReLU-based QRNNs with the original tanh-based QRNNs and Long
Short-Term Memory networks (LSTMs) on sentiment classification and word-level
language modeling. Additionally, we evaluate on character-level language
modeling, showing that we are able to stack up to eight QRNN layers with
DReLUs, thus making it possible to improve the current state-of-the-art in
character-level language modeling over shallow architectures based on LSTMs
Learning When Not to Answer: A Ternary Reward Structure for Reinforcement Learning based Question Answering
In this paper, we investigate the challenges of using reinforcement learning
agents for question-answering over knowledge graphs for real-world
applications. We examine the performance metrics used by state-of-the-art
systems and determine that they are inadequate for such settings. More
specifically, they do not evaluate the systems correctly for situations when
there is no answer available and thus agents optimized for these metrics are
poor at modeling confidence. We introduce a simple new performance metric for
evaluating question-answering agents that is more representative of practical
usage conditions, and optimize for this metric by extending the binary reward
structure used in prior work to a ternary reward structure which also rewards
an agent for not answering a question rather than giving an incorrect answer.
We show that this can drastically improve the precision of answered questions
while only not answering a limited number of previously correctly answered
questions. Employing a supervised learning strategy using depth-first-search
paths to bootstrap the reinforcement learning algorithm further improves
performance.Comment: Accepted at NAACL 2019. Version 1 was presented at NIPS 2018 workshop
on Relational Representation Learnin
Improving language modeling using densely connected recurrent neural networks
In this paper, we introduce the novel concept of densely connected layers
into recurrent neural networks. We evaluate our proposed architecture on the
Penn Treebank language modeling task. We show that we can obtain similar
perplexity scores with six times fewer parameters compared to a standard
stacked 2-layer LSTM model trained with dropout (Zaremba et al. 2014). In
contrast with the current usage of skip connections, we show that densely
connecting only a few stacked layers with skip connections already yields
significant perplexity reductions.Comment: Accepted at Workshop on Representation Learning, ACL201
A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders
Powerful sentence encoders trained for multiple languages are on the rise.
These systems are capable of embedding a wide range of linguistic properties
into vector representations. While explicit probing tasks can be used to verify
the presence of specific linguistic properties, it is unclear whether the
vector representations can be manipulated to indirectly steer such properties.
We investigate the use of a geometric mapping in embedding space to transform
linguistic properties, without any tuning of the pre-trained sentence encoder
or decoder. We validate our approach on three linguistic properties using a
pre-trained multilingual autoencoder and analyze the results in both
monolingual and cross-lingual settings
Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules?
Character-level features are currently used in different neural network-based
natural language processing algorithms. However, little is known about the
character-level patterns those models learn. Moreover, models are often
compared only quantitatively while a qualitative analysis is missing. In this
paper, we investigate which character-level patterns neural networks learn and
if those patterns coincide with manually-defined word segmentations and
annotations. To that end, we extend the contextual decomposition technique
(Murdoch et al. 2018) to convolutional neural networks which allows us to
compare convolutional neural networks and bidirectional long short-term memory
networks. We evaluate and compare these models for the task of morphological
tagging on three morphologically different languages and show that these models
implicitly discover understandable linguistic rules. Our implementation can be
found at https://github.com/FredericGodin/ContextualDecomposition-NLP .Comment: Accepted at EMNLP 201
Zero-Shot Cross-Lingual Sentiment Classification under Distribution Shift: an Exploratory Study
The brittleness of finetuned language model performance on
out-of-distribution (OOD) test samples in unseen domains has been well-studied
for English, yet is unexplored for multi-lingual models. Therefore, we study
generalization to OOD test data specifically in zero-shot cross-lingual
transfer settings, analyzing performance impacts of both language and domain
shifts between train and test data. We further assess the effectiveness of
counterfactually augmented data (CAD) in improving OOD generalization for the
cross-lingual setting, since CAD has been shown to benefit in a monolingual
English setting. Finally, we propose two new approaches for OOD generalization
that avoid the costly annotation process associated with CAD, by exploiting the
power of recent large language models (LLMs). We experiment with 3 multilingual
models, LaBSE, mBERT, and XLM-R trained on English IMDb movie reviews, and
evaluate on OOD test sets in 13 languages: Amazon product reviews, Tweets, and
Restaurant reviews. Results echo the OOD performance decline observed in the
monolingual English setting. Further, (i) counterfactuals from the original
high-resource language do improve OOD generalization in the low-resource
language, and (ii) our newly proposed cost-effective approaches reach similar
or up to +3.1% better accuracy than CAD for Amazon and Restaurant reviews.Comment: The 3rd Workshop on Multilingual Representation Learning
(MRL@EMNLP2023
The normalized freebase distance
In this paper, we propose the Normalized Freebase Distance (NFD), a new measure for determing semantic concept relatedness that is based on similar principles as the Normalized Web Distance (NWD). We illustrate that the NFD is more effective when comparing ambiguous concepts